(Partially abridged from Alison Hill’s CS631 Labs)

What we’ll do

  • Use dplyr functions learned at the beginning of the course
  • Practice to get to know a new dataset
  • Use ggplot2 to recreate a few graphs
  • Create facetted plots with ggplot2

Data

We’ll use data from the New York’s Museum of Modern Art (MoMA):

Data Cleaning

The journey of a data scientist begins with some housekeeping on the input data, which is rarely ready-to-use but very often needs to be cleaned and prepared for the analyses.

library(tidyverse)
library(janitor)

moma <- read_csv("data_artworks.csv", col_types = cols(BeginDate = col_number(),
    EndDate = col_number(), `Length (cm)` = col_number(), `Circumference (cm)` = col_number(),
    `Duration (sec.)` = col_number(), `Diameter (cm)` = col_number())) %>%
    clean_names()

problems(moma)

Wo do a basic cleaning with stringr of gender variable, which refers to the gender of the artist (a () is used a placeholder for “various artists”)

library(stringr)
moma <- moma %>%
    mutate(gender = str_replace_all(gender, fixed("(female)",
        ignore_case = TRUE), "F"), gender = str_replace_all(gender,
        fixed("(male)", ignore_case = TRUE), "M"), num_artists = str_count(gender,
        "[:alpha:]"), num_artists = na_if(num_artists, 0), n_female_artists = str_count(gender,
        "F"), n_male_artists = str_count(gender, "M"), artist_gender = case_when(num_artists ==
        1 & n_female_artists == 1 ~ "Female", num_artists ==
        1 & n_male_artists == 1 ~ "Male"))

Let’s also do some detecting of strings in the credit_line variable.

moma <- moma %>%
    mutate(purchase = str_detect(credit_line, fixed("purchase",
        ignore_case = TRUE)), gift = str_detect(credit_line,
        fixed("gift", ignore_case = TRUE)), exchange = str_detect(credit_line,
        fixed("exchange", ignore_case = TRUE)))

Let’s clean up some dates:

  • We’ll clean up year acquired with lubridate to pull out the year.
  • We’ll rename two date variables that are the artist birth/death year, but aren’t labelled clearly.
  • We’ll do a very rough estimate of the date each piece was created, using stringr::str_extract()
library(lubridate)
moma <- moma %>%
    mutate(year_acquired = year(date_acquired)) %>%
    rename(artist_birth_year = begin_date, artist_death_year = end_date) %>%
    mutate(year_created = str_extract(date, "\\d{4}"), artist_birth_year = na_if(artist_birth_year,
        0), artist_death_year = na_if(artist_death_year, 0))

What different kinds of art classifications are available?

moma %>%
    distinct(classification) %>%
    print(n = Inf)
## # A tibble: 31 × 1
##    classification                
##    <chr>                         
##  1 Architecture                  
##  2 Mies van der Rohe Archive     
##  3 Design                        
##  4 Illustrated Book              
##  5 Print                         
##  6 Drawing                       
##  7 Film                          
##  8 Multiple                      
##  9 Periodical                    
## 10 Photograph                    
## 11 Painting                      
## 12 (not assigned)                
## 13 Architectural Model           
## 14 Product Design                
## 15 Video                         
## 16 Media                         
## 17 Performance                   
## 18 Sculpture                     
## 19 Photography Research/Reference
## 20 Software                      
## 21 Installation                  
## 22 Work on Paper                 
## 23 Audio                         
## 24 Textile                       
## 25 Ephemera                      
## 26 Collage                       
## 27 Film (object)                 
## 28 Frank Lloyd Wright Archive    
## 29 Poster                        
## 30 Graphic Design                
## 31 Furniture and Interiors

We want to focus on standard rectangular paintings:

  • Filter based on classification (“Painting”)
  • Drop all pieces of art that have either missing (NA) height or width measurements, or who have 0 for either height or width.
library(tidyr)
moma <- moma %>%
    filter(classification == "Painting") %>%
    drop_na(height_cm, width_cm) %>%
    filter(height_cm > 0 & width_cm > 0)

We focus only on a subset of columns:

moma <- moma %>%
    select(title, contains("artist"), contains("year"), contains("_cm"),
        purchase, gift, exchange, classification, department)

Now let’s export this data frame, in case we want to start right away from the cleaned data.

write_csv(moma, "artworks-cleaned.csv")

Read in the data

As you can see, we did a lot of cleaning and decision-making in the pre-processing. The data we have now contain only paintings and drawings in the MoMA collection.

If you start working from the cleaned data, you just load them from the saved CSV file:

library(here)
library(readr)
library(dplyr)
moma <- read_csv("artworks-cleaned.csv")

Know the data

You cleaned and prepared the data: now it’s time to know your data and start asking questions.

For example:

  1. How many paintings (rows) are in moma? How many variables (columns) are in moma?
  2. What is the first painting acquired by MoMA? Which year? Which artist? What title?
  3. What is the oldest painting in the collection? Which year? Which artist? What title?
  4. How many distinct artists are there?
  5. Which artist has the most paintings in the collection? How many paintings are by this artist?
  6. How many paintings by male vs female artists?

And more:

  1. How many artists of each gender are there?
  2. In what year were the most paintings acquired? Created?
  3. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title?

Let’s see how we can answer some of these questions!

How many paintings?

  • How many rows/observations are in moma?
  • How many variables are in moma?

These questions can be answered, for example, using the dplyr::glimpse() function.

moma
glimpse(moma)
## Rows: 2,253
## Columns: 23
## $ title             <chr> "Rope and People, I", "Fire in the Evening", "Portra…
## $ artist            <chr> "Joan Miró", "Paul Klee", "Paul Klee", "Pablo Picass…
## $ artist_bio        <chr> "(Spanish, 1893–1983)", "(German, born Switzerland. …
## $ artist_birth_year <dbl> 1893, 1879, 1879, 1881, 1880, 1879, 1943, 1880, 1839…
## $ artist_death_year <dbl> 1983, 1940, 1940, 1973, 1946, 1953, 1977, 1950, 1906…
## $ num_artists       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ n_female_artists  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ n_male_artists    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ artist_gender     <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ year_acquired     <dbl> 1936, 1970, 1966, 1955, 1939, 1968, 1997, 1931, 1934…
## $ year_created      <chr> "1935", "1929", "1927", "1919", "1925", "1919", "197…
## $ circumference_cm  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ depth_cm          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ diameter_cm       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ height_cm         <dbl> 104.8, 33.8, 60.3, 215.9, 50.8, 129.2, 200.0, 54.6, …
## $ length_cm         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ width_cm          <dbl> 74.6, 33.3, 36.8, 78.7, 54.0, 89.9, 200.0, 38.1, 96.…
## $ seat_height_cm    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ purchase          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ gift              <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, F…
## $ exchange          <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALS…
## $ classification    <chr> "Painting", "Painting", "Painting", "Painting", "Pai…
## $ department        <chr> "Painting & Sculpture", "Painting & Sculpture", "Pai…

There are 2253 paintings in moma.

What is the first painting acquired?

  • What is the first painting acquired by MoMA (since they started tracking)?
  • What year was it acquired?
  • Which artist?
  • What title?

Hint: These questions can be answered by combining two dplyr functions, select and arrange.

moma %>%
    select(artist, title, year_acquired) %>%
    arrange(year_acquired)

What is the oldest painting in the MoMA collection?

  • What is the oldest painting in the MoMA collection historically (since they started tracking)?
  • What year was it created?
  • Which artist?
  • What title?

Again, these questions can be answered by combining the dplyr functions select and arrange.

moma %>%
    select(artist, title, year_created) %>%
    arrange(year_created)

To do inline comments, I could say that the oldest painting is Landscape at Daybreak, painted by Odilon Redon in 1872.

How many artists?

  • How many distinct artists are there?
moma %>%
    distinct(artist)

Pro tip: You could add a tally() too to get just the number of rows. You can also then use pull() to get that single number out of the tibble:

num_artists <- moma %>%
    distinct(artist) %>%
    # tally() is short for df %>% summarise(n = n())
tally() %>%
    pull()
num_artists
## [1] 989

Then I can refer to this number in inline code chunks like: there are 989 total.

Which artist has the most paintings?

  • Which artist has the most paintings ever owned by moma?
  • How many paintings in the MoMA collection by that artist?
moma %>%
    count(artist, sort = TRUE)

In the ?count documentation, it says: “count and tally are designed so that you can call them repeatedly, each time rolling up a level of detail.” Try running count() again (leave parentheses empty) on your last code chunk.

moma %>%
    count(artist, sort = TRUE) %>%
    count()

How many paintings by male vs female artists?

moma %>%
    count(artist_gender)

Now we’ll count the number of artists by gender. You’ll need to give count two variable names in the parentheses: artist_gender and artist.

moma %>%
    count(artist_gender, artist, sort = TRUE)

This output is not super helpful as we already know that Pablo Picasso has 55 paintings in the MoMA collection. But how can we find out which female artist has the most paintings? We have a few options. Let’s first add a filter for females.

moma %>%
    count(artist_gender, artist, sort = TRUE) %>%
    filter(artist_gender == "Female")

Another option is to use another dplyr function called top_n(). Use ?top_n to see how it works. Or how it won’t work, in this context:

moma %>%
    count(artist_gender, artist, sort = TRUE) %>%
    top_n(2)

How it will work better is following a group_by(artist_gender):

moma %>%
    count(artist_gender, artist, sort = TRUE) %>%
    group_by(artist_gender) %>%
    top_n(1)

Now we can see that Sherrie Levine has 12 paintings. This is a pretty far cry from the 55 paintings by Pablo Picasso.

How many artists of each gender are there?

This is a harder question to answer than you think! This is because the level of observation in our current moma dataset is unique paintings. We have multiple paintings done by the same artists though, so counting just the number of unique paintings is different than counting the number of unique artists.

Remember how count can be used back-to-back to roll up a level of detail? We try that by running count(artist_gender) again on the last code chunk.

moma %>%
    count(artist_gender, artist) %>%
    count(artist_gender)

This output takes the previous table (made with count(artist_gender, artist)), and essentially ignores the n column. So we no longer care about how many paintings each individual artist created. Instead, we want to count the rows in this new table where each row is a unique artist. By counting by artist_gender in the last line, we are grouping by levels of that variable (so Female/Male/NA) and nn is the number of unique artists for each gender category recorded.

When were the most paintings in the collection acquired?

This is another job for dplyr::count, which we can also use to sort by the counts:

moma %>%
    count(year_acquired, sort = TRUE)

When were the most paintings in the collection created?

moma %>%
    count(year_created, sort = TRUE)

What about the first painting by a solo female artist?

To answer this question, we combine filter, select, and arrange from dplyr.

When was the first painting by a solo female artist acquired?

moma %>%
    filter(num_artists == 1 & n_female_artists == 1) %>%
    select(title, artist, year_acquired, year_created) %>%
    arrange(year_acquired)

What is the oldest painting by a solo female artist, and when was it created?

moma %>%
    filter(num_artists == 1 & n_female_artists == 1) %>%
    select(title, artist, year_acquired, year_created) %>%
    arrange(year_created)

Visualization

Plot year painted vs year acquired

Let’s recreate this plot from fivethirtyeight (mostly)!

Things to consider:

  • You’ll want to play around with setting an alpha value here - keep in mind that 0 is totally transparent and 1 is opaque.
  • Try using geom_abline() to add the line in red (use the default intercept value of 0). The actual red line is difficult to recreate - here is what the authors say: “The red regression line shows the “modernizing” of MoMA’s collection — how quickly the museum has moved toward acquiring recent paintings.”
  • Change the x- and y-axis labels and the plot title to match the plot above
ggplot(moma, aes(as.numeric(year_created), as.numeric(year_acquired))) +
    geom_point(alpha = 0.3, na.rm = TRUE) + geom_abline(intercept = c(0,
    0), colour = "red") + labs(x = "Year Painted", y = "Year Acquired",
    title = "MoMA Keeps Its Collection Current", subtitle = "Yeaf of a work's acquisition vs. year it was painted")

Facet by artist gender

Can you make the same plot above, but facet by artist gender?

For this to make sense, you probably want to do some filtering to select only those paintings where there was one “solo” artist.

moma_solo <- moma %>%
    filter(num_artists == 1)

ggplot(moma_solo, aes(as.numeric(year_created), as.numeric(year_acquired))) +
    geom_point(alpha = 0.1) + geom_abline(intercept = c(0, 0),
    colour = "red") + labs(x = "Year Painted", y = "Year Acquired") +
    ggtitle("MoMA Keeps Its Collection Current") + facet_wrap(~artist_gender)

Plot painting dimensions

Let’s (somewhat) try to recreate this scatterplot from fivethirtyeight:

Some things to consider:

  • Try filtering all paintings with height less than 600 cm and width less than 760 cm.
  • If you want to add color as in the original, you’ll need to create a new variable using mutate.

Hint: You’ll probably also want to look into case_when to create a categorical variable “on the fly” to use for coloring.

moma_dim <- moma %>%
    filter(height_cm < 600, width_cm < 760) %>%
    mutate(hw_ratio = height_cm/width_cm, hw_cat = case_when(hw_ratio >
        1 ~ "taller than wide", hw_ratio < 1 ~ "wider than tall",
        hw_ratio == 1 ~ "perfect square"))

library(ggthemes)  # to load the fivethirtyeight theme

ggplot(moma_dim, aes(x = width_cm, y = height_cm, colour = hw_cat)) +
    geom_point(alpha = 0.5) + ggtitle("MoMA Paintings, Tall and Wide") +
    scale_colour_manual(name = "", values = c("gray50", "#FF9900",
        "#B14CF0")) + theme_fivethirtyeight() + theme(axis.title = element_text()) +
    labs(x = "Width", y = "Height")

We can do better with colors!

ggplot(moma_dim, aes(x = width_cm, y = height_cm, colour = hw_cat)) +
    geom_point(alpha = 0.5) + ggtitle("MoMA Paintings, Tall and Wide") +
    scale_colour_manual(name = "", values = c("gray50", "#ee5863",
        "#6999cd")) + theme_fivethirtyeight() + theme(axis.title = element_text()) +
    labs(x = "Width", y = "Height")

We could also remove the legend and use an annotation layer instead:

ggplot(moma_dim, aes(x = width_cm, y = height_cm, colour = hw_cat)) +
    geom_point(alpha = 0.5, show.legend = FALSE) + ggtitle("MoMA Paintings, Tall and Wide") +
    scale_colour_manual(name = "", values = c("gray50", "#ee5863",
        "#6999cd")) + theme_fivethirtyeight() + theme(axis.title = element_text()) +
    labs(x = "Width", y = "Height") + annotate(x = 200, y = 380,
    geom = "text", label = "Taller than\nWide", color = "#ee5863",
    size = 5, hjust = 1, fontface = 2) + annotate(x = 375, y = 100,
    geom = "text", label = "Wider than\nTall", color = "#6999cd",
    size = 5, hjust = 0, fontface = 2)

Plot something new & different!

It can be anything - you can change colors, add annotations, switch the geoms, add new variables to examine. The only requirements are:

  1. You make one new plot that is original, and
  2. You write 1-2 sentences to present the plot and why it makes sense. What questions do you think your plot can help you to answer?

It does not have to be publication-ready right now, but it should make sense as a visualization.